fix: backfill abstract from file content in vectorize_file#1343
Open
yc111233 wants to merge 1 commit intovolcengine:mainfrom
Open
fix: backfill abstract from file content in vectorize_file#1343yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233 wants to merge 1 commit intovolcengine:mainfrom
Conversation
When index_resource calls vectorize_file without a summary in summary_dict, the abstract field on the Context is set to an empty string. This means leaf (L2) records in the vector database end up with an empty abstract. Downstream, hierarchical_retriever passes these empty abstracts as documents to the rerank API, which causes rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400 because they reject empty document strings. Fix: when vectorize_file reads raw file content for embedding and the abstract is still empty, backfill it with the first 200 characters of the file content. This ensures every L2 record has a non-empty abstract for reranking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
index_resourcecallsvectorize_filewithout a summary insummary_dict, theabstractfield on theContextis set to an empty stringabstractfieldshierarchical_retrieverpasses these empty abstracts as documents to the rerank API, causing rerank providers (e.g. DashScope qwen3-rerank) to return HTTP 400 because they reject empty document stringsRoot cause
index_resource(line 371) callsvectorize_file(summary_dict={"name": file_name})— no"summary"key. Insidevectorize_file,summary = summary_dict.get("summary", "")resolves to"", which becomesContext(abstract=""). The file content IS read for embedding but never used to populateabstract.Fix
When
vectorize_filereads raw file content for embedding and the abstract is still empty, backfill it with the first 200 characters of the file content:Impact
Every L2 record created via
index_resourcewill now have a non-emptyabstract, preventing rerank 400 errors.Test plan
index_resourceon a directory with text filesabstract🤖 Generated with Claude Code